Email Processing & Classification
This document explains the email processing and classification system that powers the notification pipeline for general notices and updates. It covers:
Notice classification algorithms using LLM-based prompts
Email parsing pipeline (headers, bodies, metadata)
Notice types, priority, filtering, and content enrichment
Integration with email clients, authentication, and batch processing
Examples of processed data structures, accuracy considerations, and handling of malformed or suspicious emails
The email processing system centers around a LangGraph pipeline that classifies incoming emails and extracts structured notices. It integrates with:
Google Groups client for fetching unread emails
LLM prompts for classification and extraction
Database persistence for notices and policy documents
Notification dispatch for Telegram and web push
IMAP-based email fetch"] end subgraph "Processing" ENS["EmailNoticeService
LangGraph pipeline"] NFS["NoticeFormatterService
Classification & formatting"] PPS["PlacementPolicyService
Policy extraction"] end subgraph "Persistence" DB["DatabaseService
MongoDB Notices/Jobs/Policies"] DBC["DBClient
MongoDB connection"] end subgraph "Delivery" NS["NotificationService
Dispatch to channels"] TG["TelegramService"] WP["WebPushService"] end GG --> ENS ENS --> NFS ENS --> PPS NFS --> DB PPS --> DB DB --> NS NS --> TG NS --> WP
Diagram sources
Section sources
EmailNoticeService: Orchestrates LLM-based classification and extraction for general notices; handles placement policy detection and fallbacks.
NoticeFormatterService: Provides classification, enrichment, and formatting for notices and job postings.
GoogleGroupsClient: Decodes email headers, extracts forwarded metadata, and normalizes dates to IST.
PlacementPolicyService: Specialized extraction for placement policy updates into structured Markdown with TOC generation.
DatabaseService: Persists notices, jobs, policies, and manages deduplication and stats.
NotificationService: Routes formatted notices to Telegram and web push channels.
Webhook server: Exposes health, stats, and notification endpoints for external integrations.
Section sources
The email processing pipeline follows a LangGraph workflow:
Input: Unread email IDs fetched from Google Groups
Nodes: classify → extract_notice → validate → display_results
Decision edges: classify → extract_notice if relevant; retry extraction on validation errors; skip if not a valid notice
Outputs: Persisted NoticeDocument or policy updates
Diagram sources
EmailNoticeService: Classification and Extraction#
Classification: Always marks as relevant and delegates to LLM for classification; rejects irrelevant/spam/placement-offer emails.
Extraction: Uses a structured prompt to return JSON with fields for title, content, type, source, deadlines, links, and type-specific fields.
Validation: Ensures minimum length for title/content and presence of type.
Retry logic: Retries extraction up to two times on validation errors.
Policy detection: Detects placement policy updates and triggers a secondary extraction with a specialized prompt.
Diagram sources
Section sources
NoticeFormatterService: Classification, Matching, and Formatting#
Classification: Single-label classifier for categories: update, shortlisting, announcement, hackathon, webinar, job posting.
Matching: Extracts company names and fuzzy-matches against job listings for enrichment.
Formatting: Produces Telegram-ready messages with consistent structure, IST date formatting, and optional job enrichment.
Diagram sources
Section sources
GoogleGroupsClient: Email Parsing and Metadata Extraction#
Fetches unread emails and parses multipart messages.
Extracts forwarded sender and forwarded date, normalizing to IST.
Provides safe marking/unmarking of emails as read/unread.
Diagram sources
Section sources
PlacementPolicyService: Policy Extraction and Storage#
Detects placement policy emails and converts them into structured Markdown with TOC.
Generates slugs and validates years from content.
Upserts policy documents into MongoDB.
Diagram sources
Section sources
DatabaseService: Persistence and Deduplication#
Saves notices and jobs, deduplicates by ID, and tracks sent status.
Provides stats and retrieval helpers for unsent notices and collections.
Diagram sources
Section sources
NotificationService: Channel Routing and Delivery#
Aggregates channels (Telegram, Web Push) and broadcasts messages to unsent notices.
Supports targeted channels and detailed per-channel results.
Diagram sources
Section sources
Webhook Server: External Integrations#
Exposes health, stats, push subscription, and notification endpoints.
Enables external systems to trigger update jobs and send notifications.
Diagram sources
Section sources
EmailNoticeService depends on GoogleGroupsClient for fetching, LLM for classification/extraction, and DatabaseService for persistence.
NoticeFormatterService depends on Notice and Job models and uses LLM for classification and extraction.
PlacementPolicyService depends on DatabaseService for upserting policy documents.
NotificationService depends on channel implementations and DatabaseService for unsent notices.
Webhook server composes NotificationService and WebPushService for integrations.
Diagram sources
Section sources
Batch processing: The CLI orchestrator fetches unread IDs and processes emails sequentially, marking as read upon success to prevent reprocessing.
Retry strategy: Up to two retries for extraction failures to improve robustness.
Lazy enrichment: Jobs are enriched only when matched by the LLM, minimizing expensive API calls.
Connection reuse: GoogleGroupsClient connects per-operation and disconnects to avoid stale connections.
Logging and daemon mode: Centralized logging and daemon mode reduce overhead in production.
[No sources needed since this section provides general guidance]
Common issues and resolutions:
Authentication failures (IMAP): Verify placement email and app password environment variables.
LLM extraction errors: Inspect validation errors and retry attempts; ensure prompt compliance.
Database connectivity: Confirm MongoDB connection string and collection initialization.
Webhook endpoints: Check VAPID keys and CORS configuration for push endpoints.
Duplicate notices: DatabaseService deduplicates by ID; ensure consistent notice IDs.
Section sources
The email processing and classification system leverages LLM-driven prompts to reliably extract and structure notices from multiple sources. It integrates seamlessly with email clients, enforces deduplication, and delivers notifications across channels while maintaining a modular, testable architecture.
[No sources needed since this section summarizes without analyzing specific files]
Data Model: ExtractedNotice and NoticeDocument#
ExtractedNotice: Fields include is_notice, rejection_reason, title, content, type, source, deadline, links, additional_info, and type-specific fields (students, company_name, role, package, location, eligibility_criteria, hiring_flow, job_type, event_name, topic, theme, speaker, date, time, registration_link, start_date, end_date, registration_deadline, prize_pool, team_size, organizer).
NoticeDocument: Stored representation with id, title, content, author, type, source, formatted_message, createdAt, updatedAt, sent_to_telegram, time_sent, deadline, links, students, students_count.
Section sources
Classification Criteria and Notice Types#
Notice types supported: announcement, hackathon, job_posting, shortlisting, update, webinar, reminder, internship_noc.
Classification is LLM-based; irrelevant/spam/placement-offer emails are rejected early.
Category classification for formatting: update, shortlisting, announcement, hackathon, webinar, job posting.
Section sources
Priority Assignment and Filtering#
Priority is implicit: placement offers are handled by a separate pipeline; general notices are processed after placement offers in the orchestrated email update flow.
Filtering: LLM determines relevance; validation ensures minimal quality; deduplication prevents repeated notifications.
Section sources
Integration with Email Clients and Authentication#
GoogleGroupsClient authenticates via IMAP and app password; supports fetching unread emails, extracting forwarded metadata, and marking read/unread.
Configuration: Environment variables for placement email and app password.
Section sources
Batch Processing Workflows#
CLI orchestrator: Iterates unread emails, tries placement offer detection first, then general notice extraction, persists results, and marks as read.
NotificationRunner: Sends unsent notices via Telegram and/or Web Push channels.
Section sources
Spam Detection and Duplicate Filtering#
Spam detection: LLM prompt explicitly rejects spam or irrelevant content.
Duplicate filtering: DatabaseService checks existence by ID before insertion.
Section sources
Content Enrichment and Formatting#
NoticeFormatterService enriches notices with job matching and formats Telegram-ready messages with IST timestamps and attribution.
Email date normalization: Forwarded dates and email Date headers are normalized to IST.
Section sources